[RFC] Device Array and compat Device Vector#2660
Conversation
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
4526d5c to
9721af1
Compare
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
Signed-off-by: Dan Hoeflinger <dan.hoeflinger@intel.com>
| | Aspect | Proposed (oneDPL) | Thrust | sycl-thrust | SYCLomatic | | ||
| |---|---|---|---|---| | ||
| | **Default Allocator** | `device_allocator<T>` wrapping `sycl::malloc_device`; custom `DeviceAllocator` concept | `thrust::device_allocator<T>` (CUDA `cudaMalloc`) | `device_allocator<T>` (`sycl::malloc_device`); supports alignment template parameter | USM: `sycl::usm_allocator<T, shared>` / Buffer: `__buffer_allocator<T>` | | ||
| | **Memory Model** | **Device memory** via `sycl::malloc_device`; host access triggers explicit transfers | **Device memory** via `cudaMalloc`; host access triggers explicit transfers | **Device memory** via `sycl::malloc_device`; explicit transfers | **Shared memory** via USM shared or SYCL buffer/accessor; runtime manages placement | |
There was a problem hiding this comment.
Is "host access triggers explicit transfers" accurate? I mean, if something is triggered by another action, it is implicit and not explicit, no?
| Rarely used in practice (see [usage study](usage_pattern_study.md)), | ||
| high implementation complexity for device memory. | ||
|
|
||
| - **Host-side operations block but do not synchronize with prior work.** |
There was a problem hiding this comment.
What does "block" mean in this sentence, and is it important in the context of design decisions?
There was a problem hiding this comment.
It means they wait() until the operation is complete before returning, but no guarantee that previously queue work finishes before this starts. This is mentioned because it is a departure from the thrust::device_vector. I can make this more clear, and provide better justification for why this decision is made (fitting better with sycl, and also from the usage study showing user's preferences)
| - Individual headers: | ||
| `<oneapi/dpl/device_array>` and `<oneapi/dpl/device_vector>`.. `device_vector` | ||
| would transitively include `device_array` since it depends on it. | ||
| - We could have a `compat` header and a individual `device_array` header. However, if we intend to use `device_array` within our own sycl implementations, that may impact our decision here. |
There was a problem hiding this comment.
This is not quite clear to me. What do you mean by "within our own sycl implementation", why would we want to use device_array there, and how that impacts the decision about headers?
| ## `device_span<T>` | ||
|
|
||
| `device_array` is not device-copyable (it owns memory). For kernel capture, | ||
| non-owning views, and range composition, use `device_span<T>` via `.span()`. |
There was a problem hiding this comment.
I do not think our own span class is needed, because SYCL 2020 says that std::span and sycl::span, if supported by the implementation, must be device copyable. I suggest to remove device_span and the methods that return it.
There was a problem hiding this comment.
Thanks for the pointer to this, I can look into this. Id be happy to just rely upon sycl::span.
| template <typename T, typename Alloc = device_allocator<T>> | ||
| class device_array { |
There was a problem hiding this comment.
I would like to be convinced that the allocator template parameter is needed for this class. It is needed for device_vector, sure, but why should this simplified container allow for an alternative allocator?
There was a problem hiding this comment.
The main argument from my perspective is that if we are building such a system for device_vector to have a "faithful" migration target, we will have and maintain that capability / complexity. If we have it, why not include it, and directly make device_vector depend upon device_array.
I think there are useful examples which can motivate having this feature, and it is easier to add at the beginning than later on: Memory pooling, alignment requirements, bookkeeping / debugging.
However, I'm open to the idea of having a third internal class which implements allocator friendly RAII memory management, and having both public classes hold a member of that internal class. device_array could hardcode a simple device_allocator with no frills. The question becomes whether we think there is enough utility in the feature to justify the API complexity. I don't have a strong opinion here really. The argument against is that the SYCL folks explicitly decided not to include an official usm device allocator.
| template <typename InputIt> | ||
| device_array(InputIt first, InputIt last, sycl::queue q); |
There was a problem hiding this comment.
Would it make sense to move towards range-based APIs, instead of iterator-based? A separate constructor from std::vector would not be necessary if the generic one accepts any range. For C++17, we may add a named requirement for a range that matches the C++20 concept, as well as a simple to-range converter for a pair of iterators.
| // Device-to-device copy (allocates on the provided context+device) | ||
| // Supports cross-device copies: source and destination may be on different devices | ||
| static device_array copy_from(const device_array& src, sycl::queue q); | ||
| static device_array copy_from(const device_array& src, | ||
| size_type offset, size_type count, sycl::queue q); | ||
| static device_array copy_from(const device_array& src, | ||
| sycl::context ctx, sycl::device dev); | ||
| static device_array copy_from(const device_array& src, | ||
| size_type offset, size_type count, | ||
| sycl::context ctx, sycl::device dev); |
There was a problem hiding this comment.
These functions really look like a combination of the "uninitialized" constructor followed by an assignment, just with an additional offset argument. What are the benefits of those - or why copy operations are not appropriate?
akukanov
left a comment
There was a problem hiding this comment.
The device_array still seems over-engineered to me,. I think it is better to start with tiny purposeful API and extend it on demand, rather than trying to put as much as possible from the beginning.
Actually, I would start with the use cases for this container (and code examples showing how it is supposed to be used), and design from those.
| template <typename InputIt> | ||
| device_array(InputIt first, InputIt last, sycl::queue q); |
There was a problem hiding this comment.
What is the expected category of iterators/ranges that can be used as the source of data? The name suggests that the requirements are minimal (input_iterator), but then how are the data copied to the device memory underneath the container?
| // Resize — new elements are uninitialized by default | ||
| void resize(size_type count); | ||
| void resize(size_type count, sycl::queue q); | ||
| // Resize — new elements filled with value | ||
| void resize(size_type count, const T& value); | ||
| void resize(size_type count, const T& value, sycl::queue q); |
There was a problem hiding this comment.
Does resize operate within the container capacity, or do you want it to be truly resizable?
There was a problem hiding this comment.
This is thinking with the idea that device_vector needs this functionality, and device_array handles its memory. Similar to the question on allocators, we could have an internal class which handles this functionality for both classes and "turn off" these public features for a no-frills device_array API. I think that is probably the best option.
| - **Should async overloads be in the initial proposal or deferred?** | ||
| This provides more control over synchronization than merely an in-order queue, | ||
| but it is unclear whether users who are wanting this would just want to work | ||
| with USM memory and memcpy directly. |
There was a problem hiding this comment.
In my opinion, everything that is outside of the "minimally viable" scope should be deferred. Async operations seems to be there, but even some other operations might be there as well.
There was a problem hiding this comment.
Sure, we can cut this down to a minimal version.
| // Single-element host access (blocking, creates queue from context & device) | ||
| T read(size_type pos) const; | ||
| void write(size_type pos, const T& value); | ||
|
|
||
| // Single-element host access (blocking, provided queue is used for copy submissions) | ||
| T read(size_type pos, sycl::queue q) const; | ||
| void write(size_type pos, const T& value, sycl::queue q); |
There was a problem hiding this comment.
I would call these host_{read,write} for clarity.
| `device_array` is not device-copyable (it owns memory). For kernel capture, | ||
| non-owning views, and range composition, use `device_span<T>` via `.span()`. |
There was a problem hiding this comment.
I do not quite understand why "it owns memory` means it is not device-copyable. The memory for data is clearly not embedded into the class layout, so the ownership is logical, not physical.
There was a problem hiding this comment.
I can update this language, the technical reason is that a context is owned here. You are correct that the ownership is logical rather than physical. This class is not meant to be used directly on the device, but rather via a span or a raw pointer.
There was a problem hiding this comment.
This class is not meant to be used directly on the device, but rather via a span or a raw pointer.
And this is what bothers me. We set usage restrictions based on the anticipated implementation details, while ideally it should be the opposite.
I really would like this container to be usable as a sized random-access range on the device, so that its use with oneDPL algorithms would be natural and require no special support. For that, it should be device copyable. For that, we should either avoid embedding types that are not device copyable, or just ignore their existence and claim that it is device copyable anyway, designing the rest in such a way that operations allowed on a device never touch the parts that are not supposed to be there.
There was a problem hiding this comment.
Well, without some exterior system of storing contexts (in a global registry for instance), there are technical limitations within SYCL that make this challenging.
A (non device-copyable) context is the minimum requirement to do transfers and from the device, and deallocate memory. So either we store the context here, or must have a system for storing them globally and storing pointers, or searching through a stored list. I really don't like the idea of having such a global registry, but it is something we could do. Perhaps you have some other idea to get around this.
I don't think its necessarily bad or difficult to have this as the RAII "owning" container, and having lightweight non-owning ranges / iterators like a span or raw usm pointer when we want pass them around to use them in a kernel or oneDPL API. I think this separation is a feature, not a burden, in that it separates the scope of ownership from the usage. This is the same with device_vector, it is not used directly in cuda kernels, but rather after getting a non-owning device_ptr, or very commonly extracting the raw pointer and using that directly.
I much prefer thinking of device_array more similar to a std::vector than to a std::shared_ptr with reference counting or something like that.
There was a problem hiding this comment.
So either we store the context here, or must have a system for storing them globally and storing pointers, or searching through a stored list.
Why not just store it in the host allocated memory though? As far as I understand, the extra overhead would not be on the hot path, and likely rather negligible comparing to the overhead of allocating USM and transferring the actual data.
I much prefer thinking of device_array more similar to a std::vector
I fully agree. Now compare:
std::vector v(/*some args*/);
std::for_each(v.begin(), v.end(), lambda); // works
std::ranges::for_each(v, lambda); // works
tbb::parallel_for(0, v.size(), [&v](int i){ foo(v[i]); }); // works
dpl::device_array da(/*some args*/);
dpl::device_policy policy(q);
dpl::for_each(policy, da.begin(), da.end(), lambda); // works
dpl::ranges::for_each(policy, da, lamdba); // does not work, requires span or alike
q.parallel_for(da.size(), [=](sycl::id<1> i){ foo(da[i]); }); // does not work, requires da.data()There was a problem hiding this comment.
Yes, your code example is compelling, but also the TBB / vector combination does not include a host / device relationship where ownership of host side entities becomes challenging. Also, your final example uses the subscription operator from device_array on the device, which the current design does not allow for (it omits this operator). I think offering a subscription operator on this container would be very confusing if it did not provide implicit host device transfers similar to device_vector. My intention with device_array is that it provides more intentional and direct control over host vs device usage, while keeping most of the convenience of device_vector.
Please correct me if I misunderstand the suggestion, but from a technical standpoint, I think you are suggesting that when constructing a device_array we allocate some "host_state" struct which includes context, and only hold a pointer to that, allowing device_array to stay device-copyable. However, something eventually needs to destroy that host_state. Is it the destructor (after reference counting)?
Lets look at the rules for device copyable:
Type T has at least one eligible copy constructor, move constructor, copy assignment operator, or move assignment operator;
Each eligible copy constructor, move constructor, copy assignment operator, and move assignment operator is public;
The effect of each eligible copy constructor, move constructor, copy assignment operator, and move assignment operator is the same as a bitwise copy of the object;
Type T has a public non-deleted destructor; and
The destructor has no effect.
The copy and move constructors must be equivalent to bitwise copies, and the destructor should have no effect. We could maybe make this the case on the device by turning off aspects of the container with __SYCL_DEVICE_ONLY__, but I'm not sure it fits with device copyable. Even if we do, we are probably reference counting on the host side to see who is responsible for destroying the host_state and the device memory, which I'd prefer to avoid.
This is why I am in favor of having a simple host-side container for lifetime management, and a separate lightweight non-owning layer for actual usage (relying on existing usm pointer and sycl::span).
There was a problem hiding this comment.
I prefer to see an object that is passed to the device as a shadow copy of the object on the host, rather than an independent copy participating in the lifetime management. In other words, no reference counting involved, and instead we say that the host instance of the class is solely responsible for lifetime management and should remain alive for as long as any device code may access the data in any possible way, including via the USM pointers. That is the same requirement you have to have anyway, just extended to the shadow copies of the object as well.
And with that documented semantical restriction, making a shadow copy is equivalent to bitwise copy, and destroying a shadow copy has no effect. In other words, if the lifetime control conditions are met in the user code, the object is de facto device copyable - that is, a SYCL implementation can safely copy the bits back and forth as it wants.
This RFC proposes adding an experimental
device_array- a RAII container for USM device memory. It gives users a focused interface for managing device allocations and and provides the most-travelled usage patterns fordevice_vector.The RFC also proposed adding a
compatnamespace anddevice_vectorwithin it.device_vectorwrapsdevice_arrayand provides a similar set of functionality to Thrust'sdevice_vector.Four documents:
device_arraydevice_vectorand helper classesthrust::device_vectoranddpct::device_vectorusages, includes analysis of alternatives and why they diverged from Thrust